Introduction

Let’s explore white wines!!

White wines get no love. At least in my experience, friends and dinner dates are more likely to jump for the Cabs and Pinots than for a nice Sauvignon Blanc or Riesling. But perhaps that is more of a factor of approachability, and a lack of understanding. So let’s dive into the data, and maybe we can unearth some of the secrets of what makes a good white wine, and understand why white wines are amazing!

Univariate Plots Section

First I’m going to explore the structure:

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

There are roughly 5,000 wines with 12 variables each. I think a good starting point is looking at the distribution of quality. As the only integer variable, and our likely dependent variable, it’s going to be good to get an idea of the distribution of quality wines in this dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

There are many mediocre wines. But we also have 5 great wines (rated 9) and a good mix of decent wines in the 7’s and 8’s.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

For the sake of comparison, the density of water is 1.0 g/cm^3, ethanol (alcohol) is 0.7893 g/cm^3, and sugar (glucose and fructose) ranges from 1.54 to 1.69 g/cm^3. We would expect wines to be less dense than water, having more alcohol, hence the density being below 1.0. The few wines that are at or above 1.0 should have a higher sugar content. This could be fun to look at. With that in mind, I want to look at the alcohol and sugar data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol percentage is pretty straightforward. It’s telling us how much acohol is in the wine. The mean and median are around 10.4 and 10.5 respectively, and the distribution is skewed to the right with the majority of wines in the 9-11 range.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Residual Sugar is sugar that has not been fermented by the yeast and bacteria into alcohol and other compounds. I set up a second graph to look at the tail, and it appears there are not that many outliers. I would expect these high residual sugar wines to also have the higher densities as graphed above.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH is just telling us about the acidity of the wine. The lower the number, the more acidic the wine. I would expect this to be highly correlated with fixed and volatile acidity, as well as citric acid.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Fixed Acidity is the measure in grams per liter of tartaric acid in the wine (fun fact: 1 decimeter cubed is equal to the volume of a liter). It is called a fixed acid, or nonvolatile, because it is difficult separate from the wine. Although there are many different acids that contribute to the taste of the wine, such as malic and succinic, the paper authors chose to test for tartaric acid. Wines typically have more malic acid. Interestingly wines from cool climates are typically higher in acidity than those from warmer places.

Anyway, it looks like the fixed acidity is typically in the 6.0 - 8.0 g/dm^3 range.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

There is much less volatile acidity in these wines, which is a good thing. The researchers were testing for the presence of acetic acid, which is typically produced by acetobacteria, which convert alcohol and glucose into acetic acid. Acetic Acid is vinegar. There is usually some acetic acid naturally present due to the byproducts of natural yeast and bacteria that live on the grapes. In higher concentrations, acetic acid is a sign of spoilage, meaning a winemaker maybe didn’t ferment well. Typically, winemakers will add potassium sulphate, which we’ll see more later, to introduce Sulfur Dioxide (free sulfur dioxide) which acts as a preservative by killing acetobacteria. As it is, this sample doesn’t have concentrations higher than 0.5 g/dm^3. I don’t expect this variable to be very important to the wines quality, as the sensory threshold is roughly 0.6 to 0.9 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Citric acid, as far as the internet is telling me, is an additive for creating a “fresh” or “crisp” taste to the wines. It looks like most of the wines have a concentration of around 0.3 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Interestingly, there is some salt present in the wines. I believe this is a natural byproduct. The grapes produce salt as they grow. For the most part, the concentrations seem fairly minor, so it’ll be interesting to see if this has an impact on the quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Potassium Sulphate is added to wine as a preservative. Fun Fact, there is no scientific evidence that sulphites present in wine (and ALL wines have sulphites present) causes headaches. Anyway, potassium sulphate dissolves in the wine making both free and bound sulfur dioxide. Free sulfur dioxide in high concentrations contributes to a gassy smell, whereas bound sulfur dioxides bind to yeast and bacteria, acts as an antioxidant, and binds to heavy compounds to preserve and mellow out wine. The relevant charts for both are below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

I think it’s good that, in general, we see total sulfur dioxide in concentrations more than free sulfur dioxide. Other than that, I don’t have much else to add here. It’s going to be more interesting to see the relationship between the different variables, and how they relate to quality.

Univariate Analysis

There are 4,898 different white wines with a integer variable quality and 11 continuous variables testing for the presence of various physical compounds in the wine. Of those variables, I think it’s fair to group certain variables with each other:

I expect chlorides to be a non-factor when it comes to quality, due to the relatively low concentrations in all the wines. I may create some new variables out of the groups above, depending on the results of the bivariate analysis below.

For the most part, the data appears to be fairly clean, so I’m not expecting any issues with moving forward.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.000           -0.023       0.289
## volatile.acidity            -0.023            1.000      -0.149
## citric.acid                  0.289           -0.149       1.000
## residual.sugar               0.089            0.064       0.094
## chlorides                    0.023            0.071       0.114
## free.sulfur.dioxide         -0.049           -0.097       0.094
## total.sulfur.dioxide         0.091            0.089       0.121
## density                      0.265            0.027       0.150
## pH                          -0.426           -0.032      -0.164
## sulphates                   -0.017           -0.036       0.062
## alcohol                     -0.121            0.068      -0.076
## quality                     -0.114           -0.195      -0.009
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.089     0.023              -0.049
## volatile.acidity              0.064     0.071              -0.097
## citric.acid                   0.094     0.114               0.094
## residual.sugar                1.000     0.089               0.299
## chlorides                     0.089     1.000               0.101
## free.sulfur.dioxide           0.299     0.101               1.000
## total.sulfur.dioxide          0.401     0.199               0.616
## density                       0.839     0.257               0.294
## pH                           -0.194    -0.090              -0.001
## sulphates                    -0.027     0.017               0.059
## alcohol                      -0.451    -0.360              -0.250
## quality                      -0.098    -0.210               0.008
##                      total.sulfur.dioxide density     pH sulphates alcohol
## fixed.acidity                       0.091   0.265 -0.426    -0.017  -0.121
## volatile.acidity                    0.089   0.027 -0.032    -0.036   0.068
## citric.acid                         0.121   0.150 -0.164     0.062  -0.076
## residual.sugar                      0.401   0.839 -0.194    -0.027  -0.451
## chlorides                           0.199   0.257 -0.090     0.017  -0.360
## free.sulfur.dioxide                 0.616   0.294 -0.001     0.059  -0.250
## total.sulfur.dioxide                1.000   0.530  0.002     0.135  -0.449
## density                             0.530   1.000 -0.094     0.074  -0.780
## pH                                  0.002  -0.094  1.000     0.156   0.121
## sulphates                           0.135   0.074  0.156     1.000  -0.017
## alcohol                            -0.449  -0.780  0.121    -0.017   1.000
## quality                            -0.175  -0.307  0.099     0.054   0.436
##                      quality
## fixed.acidity         -0.114
## volatile.acidity      -0.195
## citric.acid           -0.009
## residual.sugar        -0.098
## chlorides             -0.210
## free.sulfur.dioxide    0.008
## total.sulfur.dioxide  -0.175
## density               -0.307
## pH                     0.099
## sulphates              0.054
## alcohol                0.436
## quality                1.000

So the highest correlations are with density and both residual sugar (0.839) and alcohol (-0.780), which makes sense as I explained above. Free Sulfur Dioxide is decenlty correlated with Total Sulfur Dioxide, although I would have expected a much larger relationship. And, for the most part, nothing is fairly highly positively or negatively correlated with quality, or with each other. This might make the rest of the analysis kind of difficult.

This ggplot does a good job of showing just how uncorrelated a bunch of the variables are. I am going to try removing some outliying values, and see if that makes an of the variables more correlated.

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 775   775           9.1             0.27        0.45           10.6
## 821   821           6.6             0.36        0.29            1.6
## 828   828           7.4             0.24        0.36            2.0
## 877   877           6.9             0.36        0.34            4.2
## 1606 1606           7.1             0.26        0.49            2.2
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 775      0.035                  28                  124 0.99700 3.20
## 821      0.021                  24                   85 0.98965 3.41
## 828      0.031                  27                  139 0.99055 3.28
## 877      0.018                  57                  119 0.98980 3.28
## 1606     0.032                  31                  113 0.99030 3.37
##      sulphates alcohol quality
## 775       0.46    10.4       9
## 821       0.61    12.4       9
## 828       0.48    12.5       9
## 877       0.36    12.7       9
## 1606      0.42    12.9       9
##    99.9% 
## 1.002466
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.000           -0.027       0.292
## volatile.acidity            -0.027            1.000      -0.162
## citric.acid                  0.292           -0.162       1.000
## residual.sugar               0.085            0.045       0.087
## chlorides                    0.024            0.070       0.118
## free.sulfur.dioxide         -0.049           -0.102       0.103
## total.sulfur.dioxide         0.087            0.084       0.123
## density                      0.268            0.002       0.152
## pH                          -0.429           -0.034      -0.167
## sulphates                   -0.020           -0.038       0.066
## alcohol                     -0.123            0.066      -0.088
## quality                     -0.110           -0.194      -0.011
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.085     0.024              -0.049
## volatile.acidity              0.045     0.070              -0.102
## citric.acid                   0.087     0.118               0.103
## residual.sugar                1.000     0.089               0.324
## chlorides                     0.089     1.000               0.103
## free.sulfur.dioxide           0.324     0.103               1.000
## total.sulfur.dioxide          0.415     0.201               0.610
## density                       0.832     0.261               0.320
## pH                           -0.201    -0.090              -0.006
## sulphates                    -0.029     0.016               0.058
## alcohol                      -0.463    -0.360              -0.259
## quality                      -0.100    -0.211               0.025
##                      total.sulfur.dioxide density     pH sulphates alcohol
## fixed.acidity                       0.087   0.268 -0.429    -0.020  -0.123
## volatile.acidity                    0.084   0.002 -0.034    -0.038   0.066
## citric.acid                         0.123   0.152 -0.167     0.066  -0.088
## residual.sugar                      0.415   0.832 -0.201    -0.029  -0.463
## chlorides                           0.201   0.261 -0.090     0.016  -0.360
## free.sulfur.dioxide                 0.610   0.320 -0.006     0.058  -0.259
## total.sulfur.dioxide                1.000   0.550  0.001     0.132  -0.458
## density                             0.550   1.000 -0.100     0.073  -0.806
## pH                                  0.001  -0.100  1.000     0.156   0.121
## sulphates                           0.132   0.073  0.156     1.000  -0.018
## alcohol                            -0.458  -0.806  0.121    -0.018   1.000
## quality                            -0.165  -0.317  0.099     0.056   0.438
##                      quality
## fixed.acidity         -0.110
## volatile.acidity      -0.194
## citric.acid           -0.011
## residual.sugar        -0.100
## chlorides             -0.211
## free.sulfur.dioxide    0.025
## total.sulfur.dioxide  -0.165
## density               -0.317
## pH                     0.099
## sulphates              0.056
## alcohol                0.438
## quality                1.000

I don’t think there’s an appreciable difference in correlation as a result of removing some of the outlying values. I am going to proceed by using the original dataset.

I’d like to start by looking at the relationship between fixed and volatile acidity, and free and total sulfur dioxide.

This graph shows no discernable relationship between between Fixed and Volatile Acidity. I thought that the two might be related, seeing as they’re both measuring acidity, but that doesn’t seem to be the case. Let’s move on to sulfur dioxide.

I set limits to the x axis to expand the graph slightly, which really does a good job of showing the relationship between free and total suflur dioxide. Although I would expect this to be more linear, since free sulfur dioxide is a part of total sulfur dioxide, I’m a bit surprised to see the variability in suflur dioxide in the wines.

I’d like to look at the relationship between residual sugar, alcohol and density, but I think I’m going to save that till the multivariate analysis section. Instead, I’m going to explore the relationship of some of the variables with quality. First alcohol.

Interestingly enough, it almost looks like the less alcohol a wine has, the more likely it is going to be rated lower. I wonder if that is a factor of taste, or if the raters were displeased by drinking “weaker” wines. As a sort of allegory, let’s look at density with quality.

OK, so for the most part, I think a similar pattern has emerged, that higher quality wines are gerneally less dense. Most wines are in the 0.99 to 1.00 range, and I’m not sure what this translates to tasting wise.. This is almost the same conclusion as above (higher alcohol), just restated, since the more alcohol a wine has, the less dense it will be.

So, I did a little more research, and fixed acidity (tartaric acid in this case), is what gives a wine it’s sourness. I wanted to see if there was anything we could understand about the quality from it’s sourness. But it looks like the sourness of wines comes in different levels, and there doesn’t seem to be anything that really distinguishes higher quality wines from the lesser ones. Let’s look at volatile acidity levels too, as a analogue for spoilage.

In general, it appears as though the same trend emerges: all wines at all qualities display pretty varying levels of volatile acidity.

Looking at sweetness, it appears as though most wines are relatively not sweet. The threshold that distinguishes a dry wine is about 5 g/dm^3. There are a lot of wines easily below this (as represented by the bold areas in the 0 - 5 range), but also a healthy amount of wines above. However, it does appear as though there is a cap, almost, with increasing levels of quality linked to lower caps on residual sugar. As an example, the cap for a 6 quality wine looks to be around 18, with a cap of 15 for 7 quality, and a cap at 14 for 8 quality. There are not enough data points for the 9th level of quality to make an notes.

I wanted to see if there was any patter between the sweetness and the sourness of wines, but it seems to be mostly uncorrelated.

Interestingly, there almost seems to be a channel with regards to free and total sulfur dioxide in the wine. The density of the points almost makes it seem as though the sweet spot for sulfur dioxide presence is around 25-50 free and 100-150 total g/dm^3 sulfur dioxide.

This sort of shows what I’m talking about.

With the levels so low, and so heavily bunched at the bottom, I don’t think there’s much to learn here.

If we really stretch the boxplot’s y axis, the less salt content, the better the wine.

Citric Acid, weirdly, has an almost similar dynamic to quality as sulfur dioxide, where there is an almost optimal amount at around 0.33 or so.

And finally, just to see the results, pH and sulphates look completely uncorrelated with quality.

Bivariate Analysis

I would really like to see more examples of 3, 4, 8, and 9 quality wines. I think that would make it a lot easier to see the relationships between quality and the other variables. That being said, alcohol percentage and density do a good job of telling you about the quality. As for the other variables, I think there are subtle relationships, but it’s hard to tell if they are meaningful, especially considering the low correlations, and the lack of data for higher and lower quality levels.

Multivariate Plots Section

Alright, so adding alcohol as a color makes it really apparent how the higher quality wines typically have higher alcohol contents.

I really wanted to see sugar and acidity colored by alcohol. There’s a subtle relationship where the less sugar and acidity, the more alcohol. The sugar part of the makes sense, meaning the yeast did its job well. But the acidity part is really interesting here. I also colored by quality, but there was absolutely no pattern.

I think this does a fantastic job of showing the relationship between residual sugar, alcohol, and density.

I’m not sure there’s much of anything here.

I think this relationship is pretty apparent, but the more sulphates in the wine, the more free sulfur dioxide, and consequentially, the more total sulfur dioxide. Nothing of note here.

Nothing emerges here either.

This does a good job of showing the relationship between density, alcohol, and quality. But since alcohol percentage and density are so closely related (in the physical sense), I don’t think this graph is very helpful.

Not helpful.

Base on the plots, I thought I might take a stab at creating a linear model.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + density, data = wine)
## m3: lm(formula = quality ~ alcohol + density + residual.sugar, data = wine)
## m4: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides, 
##     data = wine)
## m5: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
##     fixed.acidity, data = wine)
## m6: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
##     fixed.acidity + volatile.acidity, data = wine)
## m7: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
##     fixed.acidity + volatile.acidity + free.sulfur.dioxide, data = wine)
## m8: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
##     fixed.acidity + volatile.acidity + free.sulfur.dioxide + 
##     total.sulfur.dioxide, data = wine)
## m9: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
##     fixed.acidity + volatile.acidity + free.sulfur.dioxide + 
##     total.sulfur.dioxide + sulphates, data = wine)
## m10: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
##     fixed.acidity + volatile.acidity + free.sulfur.dioxide + 
##     total.sulfur.dioxide + sulphates + pH, data = wine)
## 
## ================================================================================================================================================
##                            m1          m2          m3          m4          m5          m6          m7          m8          m9          m10      
## ------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)            2.582***  -22.492***   90.313***   87.563***   72.800***   51.143***   50.940***   49.162***   74.922***   149.901***  
##                         (0.098)     (6.165)    (12.374)    (12.392)    (14.224)    (13.784)    (13.741)    (14.268)    (14.880)     (18.760)    
##   alcohol                0.313***    0.360***    0.246***    0.237***    0.252***    0.306***    0.313***    0.313***    0.282***     0.194***  
##                         (0.009)     (0.015)     (0.018)     (0.018)     (0.020)     (0.019)     (0.019)     (0.019)     (0.020)      (0.024)    
##   density                           24.728***  -87.886***  -84.931***  -69.981***  -48.106***  -48.147***  -46.346**   -72.393***  -149.987***  
##                                     (6.079)    (12.317)    (12.340)    (14.222)    (13.783)    (13.740)    (14.280)    (14.907)     (19.029)    
##   residual.sugar                                 0.053***    0.052***    0.046***    0.044***    0.041***    0.040***    0.050***     0.081***  
##                                                 (0.005)     (0.005)     (0.006)     (0.005)     (0.005)     (0.006)     (0.006)      (0.008)    
##   chlorides                                                 -1.776**    -1.852***   -0.794      -0.926      -0.922      -0.852       -0.234     
##                                                             (0.555)     (0.556)     (0.540)     (0.539)     (0.539)     (0.537)      (0.543)    
##   fixed.acidity                                                         -0.033*     -0.049**    -0.042**    -0.042**    -0.026        0.066**   
##                                                                         (0.015)     (0.015)     (0.015)     (0.015)     (0.015)      (0.021)    
##   volatile.acidity                                                                  -2.064***   -1.993***   -1.983***   -1.939***    -1.868***  
##                                                                                     (0.110)     (0.110)     (0.112)     (0.112)      (0.112)    
##   free.sulfur.dioxide                                                                            0.004***    0.004***    0.004***     0.004***  
##                                                                                                 (0.001)     (0.001)     (0.001)      (0.001)    
##   total.sulfur.dioxide                                                                                      -0.000      -0.000       -0.000     
##                                                                                                             (0.000)     (0.000)      (0.000)    
##   sulphates                                                                                                              0.590***     0.632***  
##                                                                                                                         (0.101)      (0.100)    
##   pH                                                                                                                                  0.684***  
##                                                                                                                                      (0.105)    
## ------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.190      0.192       0.210       0.212       0.213       0.266       0.270       0.271       0.276        0.282   
##   adj. R-squared            0.190      0.192       0.210       0.211       0.212       0.265       0.269       0.269       0.274        0.280   
##   sigma                     0.797      0.796       0.787       0.787       0.786       0.759       0.757       0.757       0.754        0.751   
##   F                      1146.395    583.290     434.085     328.736     264.067     295.080     259.001     226.616     206.650      191.810   
##   p                         0.000      0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000        0.000   
##   Log-likelihood        -5839.391  -5831.127   -5776.812   -5771.696   -5769.463   -5598.011   -5582.291   -5582.183   -5564.958    -5543.767   
##   Deviance               3112.257   3101.773    3033.737    3027.406    3024.647    2820.137    2802.092    2801.969    2782.330     2758.359   
##   AIC                   11684.782  11670.255   11563.624   11555.391   11552.927   11212.022   11182.582   11184.367   11151.916    11111.534   
##   BIC                   11704.272  11696.241   11596.107   11594.371   11598.403   11263.995   11241.051   11249.333   11223.378    11189.493   
##   N                      4898       4898        4898        4898        4898        4898        4898        4898        4898         4898       
## ================================================================================================================================================

I had an m11 with citric.acid, but it did added 0 value. As it stands, the predictive power does increase, and all the p-values are substantial, but the R-squared is just too low to consider any of these models predictive.

## 
## Calls:
## n1: lm(formula = quality ~ alcohol, data = wine)
## n2: lm(formula = quality ~ alcohol + residual.sugar, data = wine)
## n3: lm(formula = quality ~ alcohol + residual.sugar + chlorides, 
##     data = wine)
## n4: lm(formula = quality ~ alcohol + residual.sugar + chlorides + 
##     volatile.acidity, data = wine)
## n5: lm(formula = quality ~ alcohol + residual.sugar + chlorides + 
##     volatile.acidity + free.sulfur.dioxide, data = wine)
## n6: lm(formula = quality ~ alcohol + residual.sugar + chlorides + 
##     volatile.acidity + free.sulfur.dioxide + sulphates, data = wine)
## n7: lm(formula = quality ~ alcohol + residual.sugar + chlorides + 
##     volatile.acidity + free.sulfur.dioxide + sulphates + pH, 
##     data = wine)
## 
## ====================================================================================================
##                           n1         n2         n3         n4         n5         n6         n7      
## ----------------------------------------------------------------------------------------------------
##   (Intercept)           2.582***   2.021***   2.276***   2.464***   2.257***   2.044***   1.187***  
##                        (0.098)    (0.117)    (0.135)    (0.131)    (0.135)    (0.143)    (0.271)    
##   alcohol               0.313***   0.354***   0.339***   0.368***   0.374***   0.375***   0.374***  
##                        (0.009)    (0.010)    (0.011)    (0.011)    (0.011)    (0.011)    (0.011)    
##   residual.sugar                   0.022***   0.021***   0.026***   0.023***   0.023***   0.025***  
##                                   (0.002)    (0.003)    (0.002)    (0.002)    (0.002)    (0.003)    
##   chlorides                                  -2.062***  -0.907     -1.056     -1.076*    -0.939     
##                                              (0.556)    (0.540)    (0.539)    (0.538)    (0.538)    
##   volatile.acidity                                      -2.086***  -2.010***  -1.998***  -1.996***  
##                                                         (0.110)    (0.110)    (0.110)    (0.110)    
##   free.sulfur.dioxide                                               0.004***   0.004***   0.004***  
##                                                                    (0.001)    (0.001)    (0.001)    
##   sulphates                                                                    0.420***   0.365***  
##                                                                               (0.095)    (0.096)    
##   pH                                                                                      0.277***  
##                                                                                          (0.074)    
## ----------------------------------------------------------------------------------------------------
##   R-squared                0.190      0.202      0.204      0.259      0.265      0.267      0.270  
##   adj. R-squared           0.190      0.202      0.204      0.258      0.264      0.267      0.269  
##   sigma                    0.797      0.791      0.790      0.763      0.760      0.758      0.757  
##   F                     1146.395    619.354    418.558    427.455    351.981    297.659    257.788  
##   p                        0.000      0.000      0.000      0.000      0.000      0.000      0.000  
##   Log-likelihood       -5839.391  -5802.158  -5795.291  -5620.674  -5602.034  -5592.328  -5585.395  
##   Deviance              3112.257   3065.298   3056.715   2846.355   2824.773   2813.600   2805.646  
##   AIC                  11684.782  11612.317  11600.583  11253.347  11218.068  11200.655  11188.790  
##   BIC                  11704.272  11638.303  11633.066  11292.327  11263.544  11252.628  11247.259  
##   N                     4898       4898       4898       4898       4898       4898       4898      
## ====================================================================================================

I thought I might try taking out some of the variables which didn’t seem to affect R-squared that much, but I’ve only decreased the effectiveness of the linear model.

Multivariate Analysis

Coloring some of the graphs in the bivariate analysis section really highlighted the relationship between alcohol percentage and quality. I also created some nice visual representations between some of the more similar variables, such as free and total suflur dioxide and sulphates, and density, residual sugar, and alcohol.

Further, I went on to try creating a linear model for quality, but none of the models created are particularly effective at predicting the quality of the wine. Using all the available variables created a model with a predictive ability of 27%, but that isn’t high enough to consider using in a practical sense.

Final Plots and Summary

Plot One:

This graph highlights the biggest issue with this dataset: there isn’t enough data. I think to really get a better understanding of what makes a good quality wine, the scale either needs to be continuous, or we need more high and low quality wines.

Plot Two:

This plot does an excellent job of showing the relationship between alcohol percentage and quality. Perhaps not surprisingly, the more alcohol, the higher the quality.

Plot Three:

Finally, the third graph really shows the relationship between residual sugar, alcohol, and density. While not doing much to contribute to the quality of the wine, I thought this was a neat visualization of one of the physical aspects of wine.

Reflection

The wine dataset contained 4,898 observations of wine from various regions of Portugal. I started by getting a sense of the individual variables, before looking into their relationships with each other. As expected with any dataset, each variable had some large outliers. I built a correlation table both with and without the outlying variable rows, but there wasn’t an appreciable difference in correlations. For the most part, outside of alcohol, no variables were particularly correlated with quality. There were some variables that were well correlated, such as alcohol and density, and free and total sulfur dioxide, but these tended to related to actual physical relationships.

During my bivariate plotting, I did find some subtle relationships between the different variables and quality. For example, free and total sulfur dioxide tended to funnel towards an ideal amount for higher qualities. Higher qualities had less salt, volatile acid, and residual sugar. For the most part, however, there was no discernable relationship. Adding a colored variable highlighted this further.

For the most part, alcohol was the best indicator of wine quality, with more alcohol meaning higher quality. I’d like to joke that the wine judges liked wines that got you inebriated faster, but I some cursory research showed that higher alcohol contents adds a lot more complexity to texture and taste. If anything the variability of wines in the middle tier, showed that wine quality is a very complex thing.

I tried creating a linear model to predict quality, but the results, though significant, had very low r-squared values, leading me to believe that the models might not hold up well in predicting a wines quality.

I think the biggest detractor from this dataset is that there isn’t enough data on very low and very high quality wines. If there were some more wines in those categories, then perhaps some of the relationships might be more revealing. Further, the data might require a more sophisticated modelling method to derive any meaningful results, and I don’t have the knowledge to do that quite yet. This has been fun, and it’s a lovely day here in Southern California, and now I really want to go have some wine!

Bibliography

Wine Quality info Summary: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt,
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Vinho Verde website, for more on the wines used in the dataset: http://www.vinhoverde.pt/en/homepage

UC Davis Waterhouse Lab, “What’s In Wine”: http://waterhouse.ucdavis.edu/whats-in-wine

Sulfur Dioxide wikipedia: https://en.wikipedia.org/wiki/Sulfur_dioxide

Wine Mouthfeel and Texture: https://wine.appstate.edu/sites/wine.appstate.edu/files/Wine%20Mouthfeel%20and%20Texture.pdf